Speech coding/decoding using phase spectrum corresponding to a transfer function having at least one pole outside the unit circle

ABSTRACT

A decoder for speech signals receives magnitude spectral information for synthesis of a time-varying signal. From the magnitude spectral information, phase spectrum information is computed corresponding to a minimum phase filter which has a magnitude spectrum corresponding to the magnitude spectral information. From the magnitude spectral information and the phase spectral information, a time-varying signal is generated. The phase spectrum of the signal is modified by phase adjustment.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is concerned with speech coding and decoding, andespecially with systems in which the coding process fails to convey allor any of the phase information contained in the signal being coded.

2. Related Art

A known speech coder and decoder is shown in FIG. 1 and is furtherdiscussed below. However, such prior art is based on assumptionsregarding the phase spectrum which can be further improved.

SUMMARY OF THE INVENTION

According to one aspect of the present invention there is provided adecoder for speech signals comprising:

means for receiving magnitude spectral information for synthesis of atime-varying signal;

means for computing, from the magnitude spectral information, phasespectrum information corresponding to a minimum phase filter which has amagnitude spectrum corresponding to the magnitude spectral information;

means for generating, from the magnitude spectral information and thephase spectral information, the time-varying signal; and

phase adjustment means operable to modify the phase spectrum of thesignal.

In another aspect the invention provides a decoder for decoding speechsignals comprising information defining the response of a minimum phasesynthesis filter and, for synthesis of an excitation signal, magnitudespectral information, the decoder comprising:

means for generating, from the magnitude spectral information, anexcitation signal;

a synthesis filter controlled by the response information and connectedto filter the excitation signal; and

phase adjustment means for estimating a phase-adjustment signal tomodify the phase of the signal.

In a further aspect, the invention provides a method of coding anddecoding speech signals, comprising:

(a) generating signals representing the magnitude spectrum of the speechsignal;

(b) receiving the signals;

(c) generating from the received signals a synthetic speech signalhaving a magnitude spectrum determined by the received signals andhaving a phase spectrum which corresponds to a transfer function having,when considered as a z-plane plot, at least one pole outside the unitcircle.

BRIEF DESCRIPTION OF DRAWINGS

Some embodiments of the invention will now be described, by way ofexample, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a known speech coder and decoder;

FIG. 2 illustrates a model of the human vocal system;

FIG. 3 is a block diagram of a speech decoder according to oneembodiment of the present invention;

FIGS. 4 and 5 are charts showing test results obtained for the decoderof FIG. 3;

FIG. 6 is a graph of the shape of a (known) Rosenberg pulse;

FIG. 7 is a block diagram of a second form of speech decoder accordingto the invention;

FIG. 8 is a block diagram of a known type of speech coder;

FIG. 9 is a block diagram of a third embodiment of decoder in accordancewith the invention, for use with the coder of FIG. 9; and

FIG. 10 is a z-plane plot illustrating the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

This first example assumes that a sinusoidal transform coding (STC)technique is employed for the coding and decoding of speech signals.This technique was proposed by McAulay and Quatieri and is described intheir paper “Speech Analysis/Synthesis based on a SinusoidalRepresentation”, R. J. McAulay and T. F. Quatieri, IEEE Trans. Acoust.Speech Signal Process. ASSP-34, pp. 744-754, 1986; and “Low-rate SpeechCoding based on the Sinusoidal Model” by the same authors, in “Advancesin Speech Signal Processing”, Ed. S. Furui and M. M. Sondhi, MarcelDekker Inc., 1992. The principles are illustrated in FIG. 1 where acoder receives speech samples s(n) in digital form at an input 1;segments of speech of typically 20 ms duration are subject to Fourieranalysis in a Fast Fourier Transform unit 2 to determine the short termfrequency spectrum of the speech. Specifically it is the amplitudes andfrequencies of the peaks in the magnitude spectrum that are of interest,the frequencies being assumed—in the case of voiced speech—to beharmonics of a pitch frequency which is derived by a pitch detector 3.The phase spectrum is, in the interests of transmission efficiency, notto be transmitted and a representation of the magnitude spectrum, fortransmission to a decoder, is in this example obtained by fitting anenvelope to the magnitude spectrum and characterising this envelope by aset of coefficients (e.g. LSP (line spectral pair) coefficients). Thisfunction is performed by a conversion unit 4 which receives the Fouriercoefficients and performs the curve fit and a unit 5 which converts theenvelope to LSP coefficients which form the output of the coder.

The corresponding decoder is also shown in FIG. 1. This receives theenvelope information, but, lacking the phase information, has toreconstruct the phase spectrum based on some assumption. The assumptionused is that the magnitude spectrum represented by the received LSPcoefficients is the magnitude spectrum of a minimum-phase transferfunction—which amounts to the assumption that the human vocal system canbe regarded as a minimum phase filter impulsively excited. Thus a unit 6derives the magnitude spectrum from the received LSP coefficients and aunit 7 calculates the phase spectrum which corresponds to this magnitudespectrum based on the minimum phase assumption. From the two spectra asinusoidal synthesiser 8 generates the sum of a set of sinusoids,harmonic with the pitch frequency, having amplitudes and phasesdetermined by the spectra.

In sinusoidal speech synthesis, a synthetic speech signal y(n) isconstructed by the sum of sine waves: $\begin{matrix}{{y(n)} = {\sum\limits_{k = 1}^{N}{A_{k}{\cos \left( {{\omega_{k}n} + \varphi_{k}} \right)}}}} & 1\end{matrix}$

where A_(k) and φ_(k) represent the amplitude and phase of each sinewave component associated with the frequency track ω_(k), and N is thenumber of sinusoids.

Although this is not a prerequisite, it is common to assume that thesinusoids are harmonically related, thus: $\begin{matrix}{{y(n)} = {\sum\limits_{k = 1}^{N}{A_{k}{\cos \left( {\psi_{k} + \varphi_{k}} \right)}}}} & 2\end{matrix}$

where

 ψ_(k)(n)=kω₀(n)n  3

where φ_(k)(n) represents the instantaneous relative phase of theharmonics, ψ_(k)(n) represents the instantaneous linear phase component,and ω₀(n) is the instantaneous fundamental pitch frequency.

A simple example of sinusoidal synthesis is the overlap and addtechnique. In this scheme A_(k)(n), ω₀(n) and ψ_(k)(n) are updatedperiodically, and are assumed to be constant for the duration of ashort, for example 10 ms, frame. The i'th signal frame is thussynthesised as follows: $\begin{matrix}{{y^{i}(n)} = {\sum\limits_{k = 1}^{N^{i}}{A_{k}^{i}{\cos \left( {{k\quad \omega_{n}^{i}n} + \varphi_{k}^{i}} \right)}}}} & 4\end{matrix}$

Note that this is essentially an inverse discrete Fourier transform.Discontinuities at frame boundaries are avoided by combining adjacentframes as follows:

ŷ ^(i)(n)=W(n)y ^(i-1)(n)+W(n−T)y ^(i)(n−T)  5

where W(n) is an overlap and add window, for example triangular ortrapezoidal, T is the frame duration expressed as a number of sampleperiods and

W(n)+W(n−T)=1  6

In an alternative approach, y(n) may be calculated continuously byinterpolating the amplitude and phase terms in equation 2. In suchschemes, the magnitude component A_(k)(n) is often interpolated linearlybetween updates, whilst a number of techniques have been reported forinterpolating the phase component. In one approach (McAulay andQuatieri) the instantaneous combined phase (Ψ_(k)(n)+φ(n)) and pitchfrequency ω_(o)(n) are specified at each update point. The interpolatedphase trajectory can then be represented by a cubic polynomial. Inanother approach (Kleijn) ψ_(k)(n) and φ(n) are interpolated separately.In this case φ(n) is specified directly at the update points andlinearly interpolated, whilst the instantaneous linear phase componentψ_(k)(n) is specified at the update points in terms of the pitchfrequency ω₀(n), and only requires a quadratic polynomial interpolation.

From the discussion presented above, it is clear that a sinusoidalsynthesiser can be generalised as a unit that produces a continuoussignal y(n) from periodically updated values of A_(k)(n), ω₀(n) andφ_(k)(n). The number of sinusoids may be fixed or time-varying.

Thus we are interested in sinusoidal synthesis schemes where theoriginal phase information is unavailable and φ_(k) must be derived insome manner at the synthesiser.

Whilst the system of FIG. 1 produces reasonably satisfactory results,the coder and decoder now to be described offers alternative assumptionsas to the phase spectrum. The notion that the human vocal apparatus canbe viewed as an impulsive excitation e(n) consisting of a regular seriesof delta functions driving a time-varying filter H(z) (where z is thez-transform variable) can be refined by considering H(z) to be formed bythree filters, as illustrated in FIG. 2, namely a glottal filter 20having a transfer function G(z), a vocal tract filter 21 having atransfer function V(z) and a lip radiation filter 22 with a transferfunction L(z). In this description, the time-domain representations ofvariables and the impulse responses of filters are shown in lower case,whilst their z-transforms and frequency domain representations aredenoted by the same letters in upper case. Thus we may write for thespeech signal s(n):

s(n)=e(n){circle around (×)}h(n)=e(n){circle around (×)}g(n){circlearound (×)}v(n){circle around (×)}l(n)  7

or

S(z)=E(z)H(z)=E(z)G(z)V(z)L(z)  8

Since the spectrum of e(n) is a series of lines at the pitch frequencyharmonics, it follows that at the frequency of each harmonic themagnitude of s is:

 |S(e^(Jw))|=|E(e^(jw))||H(e^(jw))|=A|H(e^(jw))|  9

where A is a constant determined by the amplitude of e(n).

and the phase is:

arg (S(e ^(jw)))=arg (E(e ^(jw)))+arg (H(e ^(jw)))=2mπ+arg (H(e^(jw)))  10

Where m is any integer.

Assuming that the magnitude spectrum at the decoder of FIG. 1corresponds to |H(e^(jω))| the regenerated speech will be degraded tothe extent that the phase spectrum used differs from arg(H(e^(jω))).

Considering now the components G, V and L, minimum phase is a goodassumption for the vocal tract transfer function V(z). Typically thismay be represented by an all-pole model having the transfer function$\begin{matrix}{{V(z)} = \frac{1}{\prod\limits_{i = 1}^{P}\left( {1 - {\rho_{i}z^{- 1}}} \right)}} & 11\end{matrix}$

where ρ_(i) are the poles of the transfer function and are directlyrelated to the formant frequencies of the speech, and P is the number ofpoles.

The lip radiation filter may be regarded as a differentiator for which:

L(z)=1−αz ⁻¹  12

where α represents a single zero having a value close to unity(typically 0.95).

Whilst the minimum phase assumption is good for V(z) and L(z), it isbelieved to be less valid for G(z). Noting that any filter transferfunction can be represented as the product of a minimum phase functionand an all pass filter, we may suppose that:

 G(z)=G_(min)(z) G_(ap)(z)  13

The decoder shortly to be described with reference to FIG. 3 is based onthe assumption that the magnitude spectrum associated with G is thatcorresponding to $\begin{matrix}{{G_{\min}(z)} = \frac{1}{\prod\limits_{i = 1}^{2}\left( {1 - {\beta_{i}z^{- 1}}} \right)}} & 14\end{matrix}$

The decoder proceeds on the assumption that an appropriate transferfunction for G_(ap) is $\begin{matrix}{{G_{ap}(z)} = \frac{\left( {1 - {\beta_{1}z^{- 1}}} \right)\left( {1 - {\beta_{2}z^{- 1}}} \right)}{\left( {1 - {\frac{1}{\beta_{1}}z^{- 1}}} \right)\left( {1 - {\frac{1}{\beta_{2}}z^{- 1}}} \right)}} & 15\end{matrix}$

The corresponding phase spectrum for G_(ap) is $\begin{matrix}\begin{matrix}{{\varphi_{F}(\omega)} = \quad {\arg \left( {G_{ap}\left( ^{j\omega} \right)} \right)}} \\{= \quad {{\tan^{- 1}\left( \frac{\beta_{1}\sin \quad \omega}{1 - {\beta_{1}\cos \quad \omega}} \right)} + {\tan^{- 1}\left( \frac{\beta_{2}\sin \quad \omega}{1 - {\beta \quad \cos \quad \omega}} \right)} -}} \\{\quad {{\tan^{- 1}\left( \frac{\sin \quad \omega}{\beta_{1} - {\cos \quad \omega}} \right)} - {\tan^{- 1}\left( \frac{\sin \quad \omega}{\beta_{2} - {\cos \quad \omega}} \right)}}}\end{matrix} & 16\end{matrix}$

In the decoder of FIG. 3, items 6, 7 and 9 are as in FIG. 1. However,the phase spectrum computed at 7 is adjusted. A unit 31 receives thepitch frequency and calculates values of φ_(F) in accordance withEquation (17) for the relevant values of ω—i.e. harmonics of the pitchfrequency for the current frame of speech. These are then added in anadder 32 to the minimum-phase values, prior to the sinusoidalsynthesiser 8.

Experiments were conducted on the decoder of FIG. 3, with a fixed valueβ₁=β₂=0.8 (though—as will be discussed below—varying β is alsopossible). These showed an improvement in measured phase error (as shownin FIG. 4) and also in subjective tests (FIG. 5) in which listeners wereasked to listen to the output of four decoders and place them in orderof preference for speech quality. The choices were scored: firstchoice=4, second=3, third=2 and fourth=1; and the scores added.

The results include figures for a Rosenberg pulse. As described by A. E.Rosenberg in “Effect of Glottal Pulse Shape on the Quality of NaturalVowels”, J. Acoust. Soc. of America. Vol. 49, No. 2, 1971, pp. 583-590,this is a pulse shape postulated for the output of the glottal filter G.The shape of a Rosenberg pulse is shown in FIG. 6 and is defined as:$\begin{matrix}\begin{matrix}{{g(t)} = {A\left( {{3\left( {t/T_{P}} \right)^{2}} - {2\left( {t/T_{P}} \right)^{3}}} \right)}} & {0 \leq t \leq T_{P}} \\{{g(t)} = {A\left( {1 - \left( \frac{t - T_{P}}{T_{N}} \right)^{2}} \right)}} & {T_{P} < t \leq {T_{P} + T_{N}}} \\{{{g(t)} = 0}\quad} & {{T_{p} + T_{N}} < t \leq p}\end{matrix} & 17\end{matrix}$

where p is the pitch period and T_(P) and T_(N) are the glottal openingand closing times respectively.

An alternative to Equation 16, therefore, is to apply at 31 a computedphase equal to the phase of g(t) from Equation (17), as shown in FIG. 7.However, in order that the component of the Rosenberg pulse spectrumthat can be represented by a minimum phase transfer function is notapplied twice, the magnitude spectrum corresponding to Equation 17 iscalculated at 71 and subtracted from the amplitude values before theyare processed by the phase spectrum calculation unit 7. The resultsgiven are for T_(P)=0.33 P, T_(N)=0.1 P.

The same considerations may be applied to arrangements in which a coderattempts to deconvolve the glottal excitation and the vocal tractresponse—so-called linear predictive coders. Here (FIG. 8) input speechis analysed (60) frame-by frame to determine parameters of a filterhaving a spectral response similar to that of the input speech. Thecoder then sets up a filter 61 having the inverse of this response andthe speech signal is passed through this inverse filter to produce aresidual signal r(n) which ideally would have a flat spectrum and whichin practice is flatter than that of the original speech. The codertransmits details of the filter response, along with information (63) toenable the decoder to construct (64) an excitation signal which is tosome extent similar to the residual signal and can be used by thedecoder to drive a synthesis filter 65 to produce an output speechsignal. Many proposals have been made for different ways of transmittingthe residual information, e.g.

(a) sending for voiced speech a pitch period and gain value to control apulse generator and for unvoiced speech a gain value to control a noisegenerator;

(b) a quantised version of the residual (RELP coding)

(c) a vector-quantised version of the residual (CELP coding)

(d) a coded representation of an irregular pulse train (MPLPC coding)

(e) particulars of a single cycle of the residual by which the decodermay synthesise a repeating sequence of frame length (Prototype waveforminterpolation or PWI) (See W. B. Kleijn, “Encoding Speech usingprototype Waveforms”, IEEE Trans. Speech and Audio Processing, Vol 1,No. 4, October 1993, pp. 386-399, and W. B. Kleijn and J. Haagen, “ASpeech Coder based on Decomposition of Characteristic Waveforms”, ProcICASSP, 1995, pp. 508-511.

In the event that the phase information about the excitation is omittedfrom the transmission, then a similar situation arises to that describedin relation to FIG. 2, namely that assumptions need to be made as to thephase spectrum to be employed. Whether phase information for thesynthesis filter is included is not an issue since LPC analysisgenerally produces a minimum phase transfer function in any case so thatit is immaterial for the purposes of the present discussion whether thephase response in included in the transmitted filter information(typically a set of filter coefficients) or whether it is computed atthe decoder on the basis of a minimum phase assumption.

Of particular interest in this context are PWI coders where commonly theextracted prototypical residual pitch cycle is analysed using a Fouriertransform. Rather than simply quantising the Fourier coefficients, asaving in transmission capacity can be made by sending only themagnitude and the pitch period. Thus in the arrangement of FIG. 9, whereitems identical to those in FIG. 8 carry the same reference numerals,the excitation unit 63—here operating according to the PWI principle andproducing at its output sets of Fourier coefficients—is followed by aunit 80 which extracts only the magnitude information and transmits thisto the decoder. At the decoder a unit 91—analogous to unit 31 in FIG.3—calculates the phase adjustment values φ_(F) using Equation 16 andcontrols the phase of an excitation generator 64. In this example, theβ₁ is fixed at 0.95 whilst β₂ is controlled as a function of the pitchperiod p, in accordance with the following table:

TABLE I Pitch β Pitch β 16-52 0.64 82-84 0.84 53-54 0.65 85-87 0.8554-56 0.66 88-89 0.86 57-59 0.70 90-93 0.87 60-62 0.71 94-99 0.88 63-640.75 100-102 0.89 65-68 0.76 103-107 0.90 69 0.78 108-114 0.91 70-720.79 115-124 0.92 73-74 0.80 125-132 0.93 75-79 0.82 133-144 0.94 80-810.83 145-150 0.95 The value of α used in F(z) for the range of pitchperiods

These values are chosen so that the all-pass transfer function ofEquation 15 has a phase response equivalent to that part of the phasespectrum of a Rosenberg pulse having T_(P)=0.4 p and T_(N)=0.16 p whichis not modelled by the LPC synthesis filter 65. As before, theadjustment is added in an adder 83 prior and converted back into Fouriercoefficients before passing to the PWI excitation generator 64.

The calculation unit 91 may be realised by a digital signal processingunit programmed to implement the Equation 16.

It is of interest to consider the effect of these adjustments in termsof poles and zeroes on the z-plane. The supposed total transfer functionH(z) is the product of G, V and L and thus has, inside the unit circle,P poles at ρ_(i) and one zero at α, and, outside the unit circle, twopoles at 1/β₁ and 1/β₂, as illustrated in FIG. 9. The effect of theinverse LPC analysis is to produce an inverse filter 61 which flattensthe spectrum by means of zeros approximately coinciding with the polesat ρ_(i). The filter, being a minimum phase filter, cannot produce zerosoutside the unit circle at 1/β₁ and 1/β₂ but instead produces zeros atβ₁ and β₂, which tend to flatten the magnitude response, but not thephase response (the filter cannot produce a pole to cancel the zero at αbut as β₁ usually has a similar value to α it is common to assume thatthe α zero and 1/β₁ pole cancel in the magnitude spectrum so that theinverse filter has zeros just at ρ_(i) and β₁. Thus the residual has aphase spectrum represented in the z-plane by two zeros at β₁ and β₂(where the β's have values corresponding to the original signal) andpoles at 1/β₁ and 1/β₂ (where the β's have values as determined by theLPC analysis). This information having been lost, it is approximated bythe all-pass filter computation according to equations (15) and (16)which have zeros and poles at these positions.

This description assumes a phase adjustment determined at allfrequencies by Equation 16. However one may alternatively apply Equation16 only in the lower part of the frequency range—up to a limit which maybe fixed or may depend on the nature of the speech, and apply a randomphase to higher frequency components.

The arrangements so far described for FIG. 9 are designed primarily forvoiced speech. To accommodate unvoiced speech, the coder has, inconventional manner, a voiced/unvoiced speech detector 92 which causesthe decoder to switch, via a switch 93, between the excitation generator64 and a voice generator whose amplitude is controlled by a gain signalfrom the coder.

Although the adjustment has been illustrated by addition of phasevalues, this is not the only way of achieving the desired result; forexample the synthesis filter 65 could instead be followed (or preceded)by an all-pass filter having the response of Equation (15).

It should be noted that, although the decoders described have beenpresented in terms of the decoding of signals coded and transmittedthereto, they may equally well serve to generate speech from codedsignals stored and later retrieved—i.e. they could form part of a speechsynthesiser.

What is claimed is:
 1. A decoder for speech signals comprising: meansfor receiving magnitude spectral information for synthesis of atime-varying signal; means for computing, from the magnitude spectralinformation, phase spectrum information corresponding to a minimum phasefilter which has a magnitude spectrum corresponding to the magnitudespectral information; means for generating, from the magnitude spectralinformation and the phase spectral information, the time-varying signal;and phase adjustment means operable to modify the phase spectrum of thesignal, the phase adjustment means being operable to adjust the phase inaccordance with the transfer function of an all-pass filter having, in az-plane representation, at least one pole outside the unit circle.
 2. Adecoder according to claim 1 in which the phase adjustment means arearranged in operation to modify the phase of the signal after generationthereof.
 3. A decoder according to claim 1 in which the phase adjustmentmeans are operable to adjust the phase in accordance with the transferfunction of an all-pass filter having, in a z-plane representation, tworeal zeros at positions β₁, β₂ inside the unit circle and two poles atpositions 1/β₁, 1/β₂ outside the unit circle.
 4. A decoder according toclaim 1 in which the position of the or each pole is constant.
 5. Adecoder according to claim 1 in which the adjustment means are arrangedin operation to vary the position of the or a said pole as a function ofpitch period information received by the decoder.
 6. A decoder fordecoding speech signals comprising information defining the response ofa minimum phase synthesis filter and, for synthesis of an excitationsignal, magnitude spectral information, the decoder comprising: meansfor generating, from the magnitude spectral information, an excitationsignal; a synthesis filter controlled by the response information andconnected to filter the excitation signal; and phase adjustment meansfor estimating a phase-adjustment signal to modify the phase of thesignal, the phase adjustment means being operable to adjust the phase inaccordance with the transfer function of an all-pass filter having, in az-plane representation, at least one pole outside the unit circle.
 7. Adecoder according to claim 6 in which the excitation generating meansare connected to receive the phase adjustment signal so as to generatean excitation having a phase spectrum determined thereby.
 8. A decoderaccording to claim 6 in which the phase adjustment means are arranged inoperation to modify the phase of the signal after generation thereof. 9.A decoder according to claim 6 in which the phase adjustment means areoperable to adjust the phase in accordance with the transfer function ofan all-pass filter having, in a z-plane representation, two real zerosat positions β₁, β₂ inside the unit circle and two poles at positions1/β₁, β₂ outside the unit circle.
 10. A decoder according to claim 6 inwhich the position of the or each pole is constant.
 11. A decoderaccording to claim 6 in which the adjustment means are arranged inoperation to vary the position of the or a said pole as a function ofpitch period information received by the decoder.
 12. A method of codingand decoding speech signals, comprising: (a) generating signalsrepresenting the magnitude spectrum of the speech signal; (b) receivingthe signals; (c) generating from the received signals a synthetic speechsignal having a magnitude spectrum determined by the received signalsand having a phase spectrum which corresponds to a transfer functionhaving, when considered as a z-plane plot, at least one pole outside theunit circle.
 13. A method according to claim 12 in which the phasespectrum of the synthetic speech signal is determined by computing aminimum-phase spectrum from the received signals and forming a compositephase spectrum which is the combination of the minimum-phase spectrumand a spectrum corresponding to the said pole(s).
 14. A method accordingto claim 12 in which the signals include signals defining aminimum-phase synthesis filter and the phase spectrum of the syntheticspeech signal is determined by the defined synthesis filter and by aphase spectrum corresponding to the said pole(s).